Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

The Book Structure Extraction Competition with the Resurgence full content software at Caen University

Identifieur interne : 000235 ( Main/Exploration ); précédent : 000234; suivant : 000236

The Book Structure Extraction Competition with the Resurgence full content software at Caen University

Auteurs : Emmanuel Giguet [France] ; Nadine Lucas [France]

Source :

RBID : Hal:hal-01071717

English descriptors

Abstract

The GREYC participated in the Structure Extraction Competition, part of the INEX/ICDAR Book track, for the third time, with the Resurgence software. We used a minimal strategy primarily based on full-content top-down document representation with two then three levels, part, chapter and section. The main idea is to use a model describing relationships for elements in the document structure. Frontiers between high-level units are detected. The periphery center relationship is calculated on the entire document and then reflected on each page. The weak points of the approach are that level hierarchy is implicit, and dependent on named levels. It does not fit with the chapter and section levels reflected in the ground-truth. The strong points are that it deals with the entire document; it handles books without ToCs, and extracts titles that are not represented in the ToC (e. g. preface); it is tolerant to OCR errors and language independent; it is simple and fast. A test on sections was run after the competition to help understand the evaluation issues with more than two levels.

Url:
DOI: 10.1007/978-3-642-35734-3_7


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">The Book Structure Extraction Competition with the Resurgence full content software at Caen University</title>
<author>
<name sortKey="Giguet, Emmanuel" sort="Giguet, Emmanuel" uniqKey="Giguet E" first="Emmanuel" last="Giguet">Emmanuel Giguet</name>
<affiliation wicri:level="1">
<hal:affiliation type="researchteam" xml:id="struct-388300" status="VALID">
<orgName>Equipe Hultech - Laboratoire GREYC - UMR6072</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-150" type="direct"></relation>
<relation name="UMR6072" active="#struct-441569" type="indirect"></relation>
<relation active="#struct-300358" type="indirect"></relation>
<relation active="#struct-300266" type="indirect"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-150" type="direct">
<org type="laboratory" xml:id="struct-150" status="VALID">
<orgName>Groupe de Recherche en Informatique, Image, Automatique et Instrumentation de Caen</orgName>
<orgName type="acronym">GREYC</orgName>
<desc>
<address>
<addrLine>Boulevard du Maréchal Juin - 14050 CAEN Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.greyc.fr</ref>
</desc>
<listRelation>
<relation name="UMR6072" active="#struct-441569" type="direct"></relation>
<relation active="#struct-300358" type="direct"></relation>
<relation active="#struct-300266" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle name="UMR6072" active="#struct-441569" type="indirect">
<org type="institution" xml:id="struct-441569" status="VALID">
<idno type="IdRef">02636817X</idno>
<idno type="ISNI">0000000122597504</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc>
<address>
<country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300358" type="indirect">
<org type="institution" xml:id="struct-300358" status="VALID">
<orgName>Ecole Nationale Supérieure d'Ingénieurs de Caen</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300266" type="indirect">
<org type="institution" xml:id="struct-300266" status="INCOMING">
<orgName>Université de Caen Basse-Normandie</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Caen</settlement>
<region type="region" nuts="2">Basse-Normandie</region>
</placeName>
<orgName type="university">Université de Caen Basse-Normandie</orgName>
</affiliation>
</author>
<author>
<name sortKey="Lucas, Nadine" sort="Lucas, Nadine" uniqKey="Lucas N" first="Nadine" last="Lucas">Nadine Lucas</name>
<affiliation wicri:level="1">
<hal:affiliation type="researchteam" xml:id="struct-388300" status="VALID">
<orgName>Equipe Hultech - Laboratoire GREYC - UMR6072</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-150" type="direct"></relation>
<relation name="UMR6072" active="#struct-441569" type="indirect"></relation>
<relation active="#struct-300358" type="indirect"></relation>
<relation active="#struct-300266" type="indirect"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-150" type="direct">
<org type="laboratory" xml:id="struct-150" status="VALID">
<orgName>Groupe de Recherche en Informatique, Image, Automatique et Instrumentation de Caen</orgName>
<orgName type="acronym">GREYC</orgName>
<desc>
<address>
<addrLine>Boulevard du Maréchal Juin - 14050 CAEN Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.greyc.fr</ref>
</desc>
<listRelation>
<relation name="UMR6072" active="#struct-441569" type="direct"></relation>
<relation active="#struct-300358" type="direct"></relation>
<relation active="#struct-300266" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle name="UMR6072" active="#struct-441569" type="indirect">
<org type="institution" xml:id="struct-441569" status="VALID">
<idno type="IdRef">02636817X</idno>
<idno type="ISNI">0000000122597504</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc>
<address>
<country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300358" type="indirect">
<org type="institution" xml:id="struct-300358" status="VALID">
<orgName>Ecole Nationale Supérieure d'Ingénieurs de Caen</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300266" type="indirect">
<org type="institution" xml:id="struct-300266" status="INCOMING">
<orgName>Université de Caen Basse-Normandie</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Caen</settlement>
<region type="region" nuts="2">Basse-Normandie</region>
</placeName>
<orgName type="university">Université de Caen Basse-Normandie</orgName>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:hal-01071717</idno>
<idno type="halId">hal-01071717</idno>
<idno type="halUri">https://hal.archives-ouvertes.fr/hal-01071717</idno>
<idno type="url">https://hal.archives-ouvertes.fr/hal-01071717</idno>
<idno type="doi">10.1007/978-3-642-35734-3_7</idno>
<date when="2012">2012</date>
<idno type="wicri:Area/Hal/Corpus">000121</idno>
<idno type="wicri:Area/Hal/Curation">000121</idno>
<idno type="wicri:Area/Hal/Checkpoint">000073</idno>
<idno type="wicri:Area/Main/Merge">000238</idno>
<idno type="wicri:Area/Main/Curation">000235</idno>
<idno type="wicri:Area/Main/Exploration">000235</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">The Book Structure Extraction Competition with the Resurgence full content software at Caen University</title>
<author>
<name sortKey="Giguet, Emmanuel" sort="Giguet, Emmanuel" uniqKey="Giguet E" first="Emmanuel" last="Giguet">Emmanuel Giguet</name>
<affiliation wicri:level="1">
<hal:affiliation type="researchteam" xml:id="struct-388300" status="VALID">
<orgName>Equipe Hultech - Laboratoire GREYC - UMR6072</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-150" type="direct"></relation>
<relation name="UMR6072" active="#struct-441569" type="indirect"></relation>
<relation active="#struct-300358" type="indirect"></relation>
<relation active="#struct-300266" type="indirect"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-150" type="direct">
<org type="laboratory" xml:id="struct-150" status="VALID">
<orgName>Groupe de Recherche en Informatique, Image, Automatique et Instrumentation de Caen</orgName>
<orgName type="acronym">GREYC</orgName>
<desc>
<address>
<addrLine>Boulevard du Maréchal Juin - 14050 CAEN Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.greyc.fr</ref>
</desc>
<listRelation>
<relation name="UMR6072" active="#struct-441569" type="direct"></relation>
<relation active="#struct-300358" type="direct"></relation>
<relation active="#struct-300266" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle name="UMR6072" active="#struct-441569" type="indirect">
<org type="institution" xml:id="struct-441569" status="VALID">
<idno type="IdRef">02636817X</idno>
<idno type="ISNI">0000000122597504</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc>
<address>
<country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300358" type="indirect">
<org type="institution" xml:id="struct-300358" status="VALID">
<orgName>Ecole Nationale Supérieure d'Ingénieurs de Caen</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300266" type="indirect">
<org type="institution" xml:id="struct-300266" status="INCOMING">
<orgName>Université de Caen Basse-Normandie</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Caen</settlement>
<region type="region" nuts="2">Basse-Normandie</region>
</placeName>
<orgName type="university">Université de Caen Basse-Normandie</orgName>
</affiliation>
</author>
<author>
<name sortKey="Lucas, Nadine" sort="Lucas, Nadine" uniqKey="Lucas N" first="Nadine" last="Lucas">Nadine Lucas</name>
<affiliation wicri:level="1">
<hal:affiliation type="researchteam" xml:id="struct-388300" status="VALID">
<orgName>Equipe Hultech - Laboratoire GREYC - UMR6072</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-150" type="direct"></relation>
<relation name="UMR6072" active="#struct-441569" type="indirect"></relation>
<relation active="#struct-300358" type="indirect"></relation>
<relation active="#struct-300266" type="indirect"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-150" type="direct">
<org type="laboratory" xml:id="struct-150" status="VALID">
<orgName>Groupe de Recherche en Informatique, Image, Automatique et Instrumentation de Caen</orgName>
<orgName type="acronym">GREYC</orgName>
<desc>
<address>
<addrLine>Boulevard du Maréchal Juin - 14050 CAEN Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.greyc.fr</ref>
</desc>
<listRelation>
<relation name="UMR6072" active="#struct-441569" type="direct"></relation>
<relation active="#struct-300358" type="direct"></relation>
<relation active="#struct-300266" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle name="UMR6072" active="#struct-441569" type="indirect">
<org type="institution" xml:id="struct-441569" status="VALID">
<idno type="IdRef">02636817X</idno>
<idno type="ISNI">0000000122597504</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc>
<address>
<country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300358" type="indirect">
<org type="institution" xml:id="struct-300358" status="VALID">
<orgName>Ecole Nationale Supérieure d'Ingénieurs de Caen</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300266" type="indirect">
<org type="institution" xml:id="struct-300266" status="INCOMING">
<orgName>Université de Caen Basse-Normandie</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Caen</settlement>
<region type="region" nuts="2">Basse-Normandie</region>
</placeName>
<orgName type="university">Université de Caen Basse-Normandie</orgName>
</affiliation>
</author>
</analytic>
<idno type="DOI">10.1007/978-3-642-35734-3_7</idno>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="mix" xml:lang="en">
<term>Data Mining and Knowledge Discovery</term>
<term>Data Storage Representation</term>
<term>Data Structures</term>
<term>Database Management</term>
<term>Information Storage and Retrieval</term>
<term>Information Systems Applications (incl. Internet)</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">The GREYC participated in the Structure Extraction Competition, part of the INEX/ICDAR Book track, for the third time, with the Resurgence software. We used a minimal strategy primarily based on full-content top-down document representation with two then three levels, part, chapter and section. The main idea is to use a model describing relationships for elements in the document structure. Frontiers between high-level units are detected. The periphery center relationship is calculated on the entire document and then reflected on each page. The weak points of the approach are that level hierarchy is implicit, and dependent on named levels. It does not fit with the chapter and section levels reflected in the ground-truth. The strong points are that it deals with the entire document; it handles books without ToCs, and extracts titles that are not represented in the ToC (e. g. preface); it is tolerant to OCR errors and language independent; it is simple and fast. A test on sections was run after the competition to help understand the evaluation issues with more than two levels.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>France</li>
</country>
<region>
<li>Basse-Normandie</li>
</region>
<settlement>
<li>Caen</li>
</settlement>
<orgName>
<li>Université de Caen Basse-Normandie</li>
</orgName>
</list>
<tree>
<country name="France">
<region name="Basse-Normandie">
<name sortKey="Giguet, Emmanuel" sort="Giguet, Emmanuel" uniqKey="Giguet E" first="Emmanuel" last="Giguet">Emmanuel Giguet</name>
</region>
<name sortKey="Lucas, Nadine" sort="Lucas, Nadine" uniqKey="Lucas N" first="Nadine" last="Lucas">Nadine Lucas</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000235 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000235 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Hal:hal-01071717
   |texte=   The Book Structure Extraction Competition with the Resurgence full content software at Caen University
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024